Exploring the Dataset

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. Historically Prosper made their loan data publicly available, but effective January 2015 this information will be made available 45 days after the end of each quarter.

Structure of variables

ListingCategory and the 2 ProsperRatings had some problematic characters that were converted to “NA”. Converted LoanOriginationDate and ListingCreationDate to recognizable Date classes. CreditScore is represented by an “Upper” and “Lower” value providing a range. All original data categories below:

[1] "ListingKey"                          "ListingNumber"                      
 [3] "ListingCreationDate"                 "CreditGrade"                        
 [5] "Term"                                "LoanStatus"                         
 [7] "ClosedDate"                          "BorrowerAPR"                        
 [9] "BorrowerRate"                        "LenderYield"                        
[11] "EstimatedEffectiveYield"             "EstimatedLoss"                      
[13] "EstimatedReturn"                     "ProsperRating..numeric."            
[15] "ProsperRating..Alpha."               "ProsperScore"                       
[17] "ListingCategory..numeric."           "BorrowerState"                      
[19] "Occupation"                          "EmploymentStatus"                   
[21] "EmploymentStatusDuration"            "IsBorrowerHomeowner"                
[23] "CurrentlyInGroup"                    "GroupKey"                           
[25] "DateCreditPulled"                    "CreditScoreRangeLower"              
[27] "CreditScoreRangeUpper"               "FirstRecordedCreditLine"            
[29] "CurrentCreditLines"                  "OpenCreditLines"                    
[31] "TotalCreditLinespast7years"          "OpenRevolvingAccounts"              
[33] "OpenRevolvingMonthlyPayment"         "InquiriesLast6Months"               
[35] "TotalInquiries"                      "CurrentDelinquencies"               
[37] "AmountDelinquent"                    "DelinquenciesLast7Years"            
[39] "PublicRecordsLast10Years"            "PublicRecordsLast12Months"          
[41] "RevolvingCreditBalance"              "BankcardUtilization"                
[43] "AvailableBankcardCredit"             "TotalTrades"                        
[45] "TradesNeverDelinquent..percentage."  "TradesOpenedLast6Months"            
[47] "DebtToIncomeRatio"                   "IncomeRange"                        
[49] "IncomeVerifiable"                    "StatedMonthlyIncome"                
[51] "LoanKey"                             "TotalProsperLoans"                  
[53] "TotalProsperPaymentsBilled"          "OnTimeProsperPayments"              
[55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"    
[57] "ProsperPrincipalBorrowed"            "ProsperPrincipalOutstanding"        
[59] "ScorexChangeAtTimeOfListing"         "LoanCurrentDaysDelinquent"          
[61] "LoanFirstDefaultedCycleNumber"       "LoanMonthsSinceOrigination"         
[63] "LoanNumber"                          "LoanOriginalAmount"                 
[65] "LoanOriginationDate"                 "LoanOriginationQuarter"             
[67] "MemberKey"                           "MonthlyLoanPayment"                 
[69] "LP_CustomerPayments"                 "LP_CustomerPrincipalPayments"       
[71] "LP_InterestandFees"                  "LP_ServiceFees"                     
[73] "LP_CollectionFees"                   "LP_GrossPrincipalLoss"              
[75] "LP_NetPrincipalLoss"                 "LP_NonPrincipalRecoverypayments"    
[77] "PercentFunded"                       "Recommendations"                    
[79] "InvestmentFromFriendsCount"          "InvestmentFromFriendsAmount"        
[81] "Investors" 

Univaritate Analysis

Depiction of the Data

LoanOriginalAmount exhibits positive skew, mean 8300 > median 6500.

MonthlyLoanPayment exhibits positive skew, mean 272 > median 217.

CreditScoreRangeLower is normally distributed with mean/med = 685/680. It seems there are some common intervals and tranches.

BorrowerRate interest rates are normally distributed mean/med = .19 However, there is a spike around .31.

There is a huge concentration between interest rates of 31.77% and 31.99%.

Distribution of the Interest rates(BorrowerRate) by Term. Most of loans’ Term is 36 months, with the median Interest rate increasing sharply from 12 to 36 months and slightly increasing to 60 months, reflecting an upward sloping borrowing curve.

DebtToIncome ratio exhibits slight positive skew mean .27 > median .22. There are more people with DebtToIncome < 0.5.

StatedMonthlyIncome exhibits positive skew mean 5600 > median 4666.

I created the variable StatedMonthlyDebt to estimate the monthly debt at inception of loan by multiplying the DebtToIncome ratio by StatedMontlyIncome. The resulting distribution is positively skewed.

EmploymentStatusDuration exhibits a strong positive skew mean 96 > median 67. More people that are employed for shorter durations need Prosper loans.

CurrentCreditLines has a slight positive skew and the mean/med = 10.

Moving to the date when loans originate, there is seasonality in the origination of loans reflecting the holiday season. It is also interesting to plot loans by year, with a significant drop in 2009 and steady uprise as the economy is improving.

Number of Loans by State

Here we see a nice graphical representation of the US map, which shows the number of loans where they originated from. In the chart right below we see the number of loans by State and their Term

Plotting ListingCategory..numeric.’, shows that Debt consolidation by far is the leading loan category in number of loans and $ amount. But it seems there are a lot of loans not classified.

LoanStatus and EmploymentStatus are 2 other categories worth exploring.

Now let’s see how the loans are distributed in terms of their ProsperRating. The number of loans by Prosper rating seems to be normally distributed, with a minor positive skew towards higher rated loans.

Finally the number of Investors per loan seem to be heavily positive skewed. Performing a log transformation to restore symmetry results in a much more normal distribution, indicating that there’s really a lognormal distribution.

Bivariate Analysis

We should further investigate the LoanStatus of the loans. We see that default rates were significantly reduced after Prosper’s 2009 relaunch. Current loans span from years 2011-2014. Since the majority of the loans created have a 36-month term, they would still be placed in the Current.

The majority of the loans were issued in 2013, with issuance growth being exponential, and the number of Current loans far outnumbering the Completed or Chargedoff ones.

Breaking up further the Issuance of loans by day, we see a cyclical pattern of loan offerings. There’s a spike in offerings around the first week of the month and again the second week of the month and then a gradual decrease towards the end of the month. This can be attributed to the fact that debt consolidation makes sense when then payment for credit card and other loans is due within the first 2 weeks of every month.

Further we can look at the seasoning of the loans meaning the date when they were initiated and what is their relative interest(BorrowerRate, EffectiveYield) and loan amount(mean/median LoanOriginalAmount).

In addition we can investigate the relationship between the LoanOriginalAmount and number of Investors which shows a significant amount of loans sponsored from few investors dominant primarily for smaller size loans than larger size loans. In addition 60 month term loans are more dominantly sponsored by fewer investors compared to 36 month term loans.

Next let’s see the relationship between Risk and Return and how it varies by Term. As ProsperRating..numeric. is a discrete variable and BorrowerRate is continuous, the chart without the jitter would look cluttered. There is a strong negative correlation(-0.953) between the return investors want and the credit quality of the borrower. Interestingly enough the return investors seek for some 60 month term loans for riskier borrowers is lower than for 36 month term loans which could be attributed to either mispricing or an expectation that the subprime borrower will not default by having a greater period for the return of of the Principal of the loan.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and BorrowerRate
## t = -917.37, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9537172 -0.9524846
## sample estimates:
##        cor 
## -0.9531049

How about the Risk and Return of borrowers with different StatedMonthlyIncome and LoanOriginalAmount? I am filtering StatedMonthlyIncome abover 0 and below 20,000 and it appears there is a positive correlation of 0.233 between Credit score and Income. Morevover, the average LoanOriginalAmount increases with Credit quality and stays flat above 5 ProsperRating..numeric.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and StatedMonthlyIncome
## t = 68.944, df = 83149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2260988 0.2389577
## sample estimates:
##       cor 
## 0.2325384

Now let’s look more closely at DebtToIncome with respect to ProsperRating..numeric. and BorrowerRate. There is a negative correlation between Credit quality and Debt/Income of -0.135 as we would have expected. By depicting the boxplots of the DebtToIncome ratio to BorrowerRate we see that there is a positive correlation, whereas with regards to StatedMonthlyIncome there is a negative correlation, as expected.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and DebtToIncomeRatio
## t = -37.385, df = 76698, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1407221 -0.1268213
## sample estimates:
##        cor 
## -0.1337783

Finally, in the Bivariate analysis it is worth investigating the relationship of Risk and Defaults. First we find that the percentage of borrowers that didn’t pay back the loans (In Default) is greater at worst risk scores (close to 25%). The Result of the loan (paid or defaulted) is not really correlated with the other variables. I’m surprised at the high correlation between ProsperRating and the LoanAmount (0.46).

In this bivariate section I explored several variables, such as Origination Date, Original Loan Amount, Stated Monthly Income, Debt To Income Ratio, Prosper Credit Rating, Default Rate. Many of them presented some correlation among each other, others barely did so.

There seems to be a good correlation among Prosper Rating with Default Rates, Original Loan Amount and Debt To Income Ratio. In addition, there is strong significant correlation between Stated Monthly Income and Borrower Rate, with of Debt amounts reaching up to the level of Income of the borrowers.

Multivariate Analysis

Investigating further the Interest Rate, Term and Debt Ratio relationship. Loans with term of 36 months seem to be dominating the borrowers with higher DebtToIncome ratios as well as capture the highest range of BorrowerRate. This is probably attributed to the fact that these loans are more commonly issued.

Expanding on the analysis of DebtToIncome where we left off in the Bivariate section, we can futher investigate the distribution of Tranches in which we divided it by Term. What is evident is that 3 different Terms of Loans have a small number of very highly levered DebtToIncome Ratios accounting for less than 15% with Debt-to-Income>1.5. Further focus in the highly levered segment see consistency on the highly levered borrowers and the very high levered borrowers.

Now looking at the DebtToIncome ratio we see that the loans with 12 month of Term, have much smaller DefaultRates as well as BorrowRates. As for 36 months we should have had the second lowest DefaultRates yet in two years 2011 and 2012 the DefaultRate is higher than the 60 month term loan. However this was also reflected via lower BorrowerRates under the 60 month term vs 36 month term, for certain loans issued in 2011 and 2012. Bear in mind that the QE programs in the US were keeping the 5 year rate of the yield curve at historically low levels, while short term volatility events such as the Europe crisis and the Debt ceiling were holding up the short term of the yield curve.

Final Plots and Summary

In this first two plot I am presenting below the relationship on a monthly basis between DefaultedLoans and BorrowerRate. It is very interesting to observe how the percentage of defaulted loans increases from February (6%) to March (9%) and peaks in April (12%). The Interest rate requested from Borrowers also increases during these months from 15% to 20%. In contrast January, February and December are the months with the smallest Interest rates and equivalently the percentage of Defaulted loans is the smallest. Therefore, we see in the chart below that as the percentage of defaulted loans increases, the average interest rate also increases.

In the second plot below plots we see the relationship and possibly lagged effect that the Median Interest rate charged to Borrowers has to Defaulted loans. Again there is very close relationship between the 2 variables across similar Terms. For Term of 60 months, Default rate increased from 2012 (20%) to 2013 (33%) and decreased in 2014 (29%). By comparison, the defaulted ratio with term of 36 month stayed stable in 2012 (23%) and 2013 (22%) and decreased in 2014 (18%).

Lastly in the thrid plot below the BorrowerRate, for 60 month Term followed a similar rize from 2011(17%) to 2012(22%), fell slightly in 2013(20%) and rose significantly in 2014(25%) . For Term of 36 months, the BorrowerRate was stable in 2011(22%) and 2012(21%), whereas it decreased in 2013(17%). It seems that there is a lagged effect that the BorrowerRate has on DefaultedRatio, as a yearly increase or decrease in the BorrowerRate if followed by one from DefaultedRatio the next year. As such the impact on loans that do not season, the ones with 12 months term, never rose above 7%.

These two figures help us understand the relationship between DefaultedRatio and BorrowerRate and how we can predict such variables vian an econometric model.

Reflection

The data set had 113,937 loans from Nov 2005 - March 2014, with total LoanOriginalAmount of $949,894,347. Obviously there are many difficulties in trying to understand, seeing what relationships would construct an interesting picture. I tried to begin by selecting 10-15 important variables from the list. With those I aimed to create a series of histograms, scatterplots, and multivariate figures. However, I was ending up making simple plots, that were not telling a coherent story. In addition I had trouble figuring out what type of chart would tell the better story and how could I edit the data to present meaninful information. In particular it was very hard figuring out how to tread fillers, factor levels, outliers and facet-wraps.

In the meantime dealing with R syntax was a challenge as I had used it in the past for statistical analysis but not for graphical representations. I suffered trying to group and reshape the data set, and until to become familiarized with the data functions and transformations and how to use the libraries available.

During my analysis, I verified that many variables that I have believed that would have a greater effect on BorrowersRate and DefaultRatio and DebtToIncome ratio, but I would like to continue my work from here, to make a predictive model and compare against the Foward data. I could use logistic regression, but it would require some freshing up on my Econometrics experience.I look forward to learning more about modeling and predictions in future classes.